Data Processing (Statistical Methods)
Measures Of Central Tendency
Measures of central tendency are statistical values that represent the center or typical value of a dataset. They aim to summarize a set of data with a single value that is most representative of the entire set.
Mean
Definition: The arithmetic average of all values in a dataset. It is calculated by summing all the values and dividing by the total number of values.
Formula (for a sample):
$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$
where:
- $\bar{x}$ (x-bar) is the sample mean
- $\sum$ (sigma) is the summation symbol
- $x_i$ represents each individual value in the dataset
- $n$ is the total number of values in the sample
Formula (for a population):
$\mu = \frac{\sum_{i=1}^{N} x_i}{N}$
where:
- $\mu$ (mu) is the population mean
- $N$ is the total number of values in the population
Characteristics: Affected by outliers (extreme values). Used when data is numerical and symmetrically distributed.
Median
Definition: The middle value in a dataset that has been arranged in ascending or descending order. If there is an even number of values, the median is the average of the two middle values.
Calculation:
- Arrange the data in ascending or descending order.
- If $n$ (number of observations) is odd, the median is the $ \frac{(n+1)}{2}^{th} $ value.
- If $n$ is even, the median is the average of the $ \frac{n}{2}^{th} $ and $ \frac{(n+2)}{2}^{th} $ values.
Characteristics: Not affected by outliers. Useful for skewed data or when dealing with ordinal data.
Mode
Definition: The value that appears most frequently in a dataset. A dataset can have one mode (unimodal), two modes (bimodal), or multiple modes (multimodal).
Calculation: Simply count the frequency of each value and identify the one with the highest frequency.
Characteristics: Can be used for both numerical and categorical data. It may not be unique, and it's not affected by outliers.
Comparison Of Mean, Median And Mode
| Feature | Mean | Median | Mode |
|----------------------|---------------------------------------|-------------------------------------------|-----------------------------------------|
| Calculation | Sum of values / number of values | Middle value in ordered data | Most frequent value |
| Affected by Outliers | Yes | No | No |
| Data Type | Numerical (Interval/Ratio) | Numerical (Ordinal/Interval/Ratio) | Numerical or Categorical (Nominal) |
| Uniqueness | Unique | Unique (or average of two middle values) | May not be unique, or may not exist |
| Skewed Data | Highly affected, may not represent center | Less affected, often a better indicator | Can be anywhere in the distribution |
| Best Used When | Data is symmetrical, no extreme outliers | Data is skewed, or outliers are present | Data has a clear peak or most common value |
Symmetrical Distribution: Mean, Median, and Mode are typically close to each other.
Positively Skewed (Tail to the right): Mean > Median > Mode
Negatively Skewed (Tail to the left): Mode > Median > Mean
Measures Of Dispersion
Measures of dispersion (or variability) describe how spread out or scattered a set of data is. They quantify the variability within a dataset, complementing measures of central tendency.
Range
Definition: The simplest measure of dispersion. It is the difference between the highest and lowest values in a dataset.
Formula:
Range = Maximum Value - Minimum Value
Characteristics:
- Easy to Calculate: Very simple to compute.
- Affected by Outliers: Highly sensitive to extreme values. A single very large or very small value can greatly inflate the range.
- Limited Information: Only uses two values in the dataset, ignoring the distribution of the rest of the data.
Standard Deviation
Definition: A measure of the average amount of variability or spread in a dataset. It represents the average distance of each data point from the mean. A low standard deviation indicates that the data points tend to be close to the mean, while a high standard deviation indicates that the data points are spread out over a wider range of values.
Formula (for a sample):
$s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}$
where:
- $s$ is the sample standard deviation
- $x_i$ is each individual value
- $\bar{x}$ is the sample mean
- $n$ is the sample size
- $(n-1)$ is used in the denominator for sample standard deviation (Bessel's correction) to provide a less biased estimate of the population standard deviation.
Formula (for a population):
$\sigma = \sqrt{\frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}}$
where:
- $\sigma$ (sigma) is the population standard deviation
- $N$ is the population size
- $\mu$ is the population mean
Characteristics: Takes into account all values in the dataset. Has the same units as the data. Widely used and forms the basis for many statistical tests.
Coefficient Of Variation (CV)
Definition: A statistical measure of the relative dispersion of data points in a data series around the mean. It is defined as the ratio of the standard deviation to the mean, expressed as a percentage.
Formula:
$CV = \left( \frac{\sigma}{\mu} \right) \times 100\%$
(Using population standard deviation $\sigma$ and mean $\mu$, or sample standard deviation $s$ and sample mean $\bar{x}$ if comparing samples).
Characteristics:
- Unitless: It is a relative measure, making it useful for comparing the variability of datasets with different units or means.
- Standardization: Provides a standardized way to compare dispersion across different datasets. A lower CV indicates less relative variability, meaning the data is more consistently clustered around the mean.
- Use: Useful in finance, biology, and other fields to compare the volatility or variability of different measurements.
Measures Of Relationship
Measures of relationship quantify the association or dependence between two or more variables. They help us understand if variables tend to change together and how strongly.
Direction Of Correlation
Definition: Indicates whether the variables tend to move in the same direction or in opposite directions.
Types:
- Positive Correlation: As one variable increases, the other tends to increase. As one decreases, the other tends to decrease. (e.g., height and weight, study hours and exam scores).
- Negative Correlation: As one variable increases, the other tends to decrease, and vice versa. (e.g., price and demand, temperature and heating costs).
- Zero or No Correlation: There is no apparent systematic relationship between the variables.
Measure: The sign of the correlation coefficient indicates the direction: '+' for positive, '-' for negative, and close to 0 for no correlation.
Degree Of Correlation
Definition: Quantifies the strength of the linear relationship between two variables. It measures how closely the data points cluster around a straight line.
Measure: The Pearson correlation coefficient (r) is the most common measure. It ranges from -1 to +1.
- $r = +1$: Perfect positive linear correlation. All data points lie exactly on a straight line with a positive slope.
- $r = -1$: Perfect negative linear correlation. All data points lie exactly on a straight line with a negative slope.
- $r = 0$: No linear correlation. The variables are not linearly related.
- Values between -1 and +1: Indicate varying degrees of linear relationship.
- $r$ close to +1: Strong positive linear correlation.
- $r$ close to -1: Strong negative linear correlation.
- $r$ close to 0: Weak or no linear correlation.
Interpretation: The magnitude of $r$ (ignoring the sign) indicates the strength of the linear relationship (e.g., $r = 0.8$ is a stronger linear relationship than $r = 0.3$).
Spearman’s Rank Correlation
Definition: A non-parametric measure of rank correlation that assesses how well the relationship between two variables can be described using a monotonic function. It is used when the data is ordinal or when the assumptions for Pearson's correlation are not met (e.g., non-linear relationships or outliers).
Calculation:
- Rank the data for each variable separately.
- Calculate the difference ($d_i$) between the ranks for each pair of observations.
- Square these differences ($d_i^2$).
- Sum the squared differences ($\sum d_i^2$).
- Apply the formula for Spearman's rank correlation coefficient ($\rho$ or $r_s$):
$\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}$
where:
- $d_i$ is the difference between the ranks of paired scores
- $n$ is the number of observations
Characteristics:
- Range: Like Pearson's $r$, it ranges from -1 to +1.
- Monotonic Relationship: Detects monotonic relationships (where variables tend to move in the same relative direction, but not necessarily at a constant rate).
- Rank-Based: Less sensitive to extreme values (outliers) than Pearson's correlation because it works with ranks.
- Usefulness: Particularly useful when dealing with ordinal data or when the relationship is monotonic but not strictly linear.